Potsdam Commentary Corpus 2.0: Annotation for Discourse Research
نویسندگان
چکیده
We present a revised and extended version of the Potsdam Commentary Corpus, a collection of 175 German newspaper commentaries (op-ed pieces) that has been annotated with syntax trees and three layers of discourse-level information: nominal coreference, connectives and their arguments (similar to the PDTB, (Prasad et al., 2008)), and trees reflecting discourse structure according to Rhetorical Structure Theory (Mann and Thompson, 1988). Connectives have been annotated with the help of a semi-automatic tool (Conano, (Stede and Heintze, 2004)) that identifies most connectives and suggests arguments based on their syntactic category. The other layers have been created manually with dedicated annotation tools. The corpus is made available on the one hand as a set of original XML files produced with the annotation tools, based on identical tokenization. On the other hand, it will be distributed together with the open-source linguistic database ANNIS3 (Chiarcos et al., 2008; Zeldes et al., 2009), which provides multi-layer search functionality and layer-specific visualization modules. This allows for comfortable qualitative evaluation of the correlations between annotation layers.
منابع مشابه
Information structure in the Potsdam Commentary Corpus: Topics
The Potsdam Commentary Corpus is a collection of 175 German newspaper commentaries annotated on a variety of different layers. This paper introduces a new layer that covers the linguistic notion of information-structural topic (not to be confused with ‘topic’ as applied to documents in information retrieval). To our knowledge, this is the first larger topic-annotated resource for German (and on...
متن کاملHandbuch Textannotation
The Potsdam Commentary Corpus is a collection of newspaper texts belonging to the ‚commentary‘ genre. The public part consists of 175 texts from Märkische Allgemeine Zeitung that have been manually annotated for syntax, coreference, connectives, and rhetorical structure. Further layers will be added to future releases of the corpus. This book assembles the annotation guidelines that have been u...
متن کاملFrom Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-...
متن کاملThe Potsdam Commentary Corpus
A corpus of German newspaper commentaries has been assembled and annotated with different information (and currently, to different degrees): part-of-speech, syntax, rhetorical structure, connectives, co-reference, and information structure. The paper explains the design decisions taken in the annotations, and describes a number of applications using this corpus with its multi-layer annotation.
متن کاملThe Penn Discourse TreeBank 2.0
We present the second version of the Penn Discourse Treebank, PDTB-2.0, describing its lexically-grounded annotations of discourse relations and their two abstract object arguments over the 1 million word Wall Street Journal corpus. We describe all aspects of the annotation, including (a) the argument structure of discourse relations, (b) the sense annotation of the relations, and (c) the attri...
متن کامل